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Abstract: The bootstrap is a popular and powerful method for assessing precision 
of estimators and inferential methods. However, for massive datasets which are in¬ 
creasingly prevalent, the bootstrap becomes prohibitively costly in computation and 
its feasibility is questionable even with modern parallel computing platforms. Recently 
'Kleiner. Talwalkar. Sarkar. and Jordan! ( 20141) proposed a method called BLB (Bag of 
Little Bootstraps) for massive data which is more computationally scalable with little 
sacrifice of statistical accuracy. Building on BLB and the idea of fast double boot¬ 
strap, we propose a new resampling method, the subsampled double bootstrap, for 
both independent data and time series data. We establish consistency of the sub¬ 
sampled double bootstrap under mild conditions for both independent and dependent 
cases. Methodologically, the subsampled double bootstrap is superior to BLB in terms 
of running time, more sample coverage and automatic implementation with less tuning 
parameters for a given time budget. Its advantage relative to BLB and bootstrap is 
also demonstrated in numerical simulations and a data illustration. 
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1 Introduction 

In the past decade, we have witnessed massive data (or big data) generated in many 
holds. Datasets grow in size in part because they are increasingly being collected by 
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ubiquitous information-sensing mobile devices, remote sensing technologies, and wire¬ 
less sensor networks, among others. Although our computing power has also been 
advancing steadily, the surge of massive data presents challenges to both computer 
scientists and statisticians in terms of data storage, computation and statistical anal¬ 
ysis. As nicely summarized in Jor d an] (120130 . a key question for statistical inference in 
the massive data context is “Can you guarantee a certain level of inferential accuracy 
within a certain time budget even as the data grow in size”? From a statistical point 
of view, there is a great need for new methods that are theoretically sound and remain 
computationally feasible even for massive data sets. The classical theoretical criteria 
to assess the quality of an inferential procedure such as mean squared error, size/power 
are still relevant, but for massive data, computational efficiency and algorithm qual¬ 
ity are also important considerations in comparing different statistical methods and 
procedures. 

With any statistical inference method, an inextricably associated problem is to 
assess the precision of that inference, and this remains important for the statistical 
analysis of massive data sets. For example after parameter estimation from a data set, 
a natural next step is to measure how precise the estimation method is, and this can 
be measured by t he me an squared error, width of confidence interval, and so on. The 
bootstrap (lEfronl (119791 ) ) is a powerful and popular procedure that can be applied to 
estimate precision for a wide variety of inference methods. It has well-known statis¬ 
tical properties including consistency and higher-order accuracy under quite general 
settings. It is conceptually appealing as it is straightforward to implement, using re¬ 
samples from the data as a proxy for samples, and is automatic in nature such that 
the user can implement it without advanced statistical knowledge. However, the ben¬ 
efits of bootstrap come at a considerable computational cost. Each iteration of the 
bootstrap involves the calculation of a statistical function on a resample of the original 
data. For a data of size n, on average each resample includes 0.63n distinct sample 
points — therefore each iteration of the bootstrap carries a computational cost of the 
same order as that of the original inference on the data. Even though this problem can 
be alleviated with the advent of modern parallel computing platforms, it is still quite 
overwhelming to repeatedly process such resampled datasets for data of huge size, say 
a terabyte. Therefore, this calls for new bootstrap methods that are computationally 
scalable while maintaining good statist ical properties. _ 

In their recent work Kleiner. Talwalkar, Sarkar, and Jordan (2014) introduced a 
new resampling procedure called Bag of Little Bootstraps (BLB, hereafter). This 
procedure consists of randomly selecting small subsets of the data, and then performing 
a bootstrap on each subset, by constructing weighted resamples of the subset such 
that the resample size equals the size of the original data. The estimator is calculated 


method bears some resemblance to the traditional subsamplin 

g 

(Politis and Romano 

(1994aB or m out of n bootstrap (Bickel. Gotze, and van Zwet 
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-997j)), which involve 


2 




















subsamples or resamples of size much smaller than the bootstrap, thereby reducing the 
computational cost. However, these methods (subsampling or m out of n bootstrap) 
usually require a rescaling of the output, to adjust for the difference between sample 
size and resample or subsample size. This feature makes them less user-friendly, since 
in order to evaluate the precision of an estimation method, the practitioner typically 
needs to know the rate o f convergence of the estimator being used. Additionally, as 
demonstrated in IKle iner et ah (2014), the performance of subsampling or m out of n 
bootstrap depends quite strongly on the choice of parameters such as subsample size. 
By contrast, the resamples in BLB are of the same size as the data, so no rescaling 
of output is needed thereby retaining the automatic and user-friendly nature of the 
bootstrap. On the other hand, although the resamples are nominally of the same size 
as the original data, they contain only a small number of distinct points coming from 
the subset, which reduces the computational cost of calculating a statistical function 
of the resamples. The estimates of precision from a few subsets can be averaged to 
obtain the BLB estim ate o f precision. 

(2014) the authors recommend a large number of resamples from 


In Kleiner et al. 


each subset, and a small number of random subsets. However, this means that only a 
small fraction of the original data is used in computing the BLB estimate, as a large 
majority of data points may not appear in any of the subsets used. Additionally, run¬ 
ning a large number of resamples on each subset might incur high computational costs, 
even if each resample has less runtime than bootstrap resamples. These two issues can 
affect the performance of BLB in terms of statistical accuracy and computational cost, 
respectively. 

When facing the trade-off between statistical accuracy and computational cost, a 
practical question we need to answer is: “given a certain computation time budget, how 
can a practitioner optimally use that budget to come up with an estimate of precision?” 
The bootstrap has an obvious answer to this question — keep taking resamples until 
the budget runs out. This answer holds true irrespective of the statistical inferential 
method whose precision is of interest. However, it is not obvious how to answer this 
question for BLB, since it is not clear how to optimize the two tuning parameters 
- namely number of resamples per subset and the number of subsets, under the time 
budget constraint. Two natural strategies would be — with a fixed number of resamples 
per subset use as many random subsets as possible, or with a fixed number of random 
subsets use as many resamples per subset as possible. Both strategies might be sub- 


optimal in practice, depending upon the particular problem at hand. Kleiner et ah 


(2 0141) suggest a novel adaptive method for selecting the tuning parameters, where 
one first fixes a tolerance parameter, and then for each subset, one can keep taking 
resamples till that tolerance level is reached. This method provides a nice way of 
adaptively choosing tuning parameters for a given level of desired accuracy. However 
for a given computational time budget, the variability of the precision estimate is not 
known a priori, and hence it is not clear how to choose an appropriate value of the 
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tolerance parameter that is neither too ambitious nor too conservative for the inference 
method of interest. 

In this article we present a new resampling procedure called the Subsampled Dou¬ 
ble B oots t rap (SDB, hereafter) for massive data. Double bootstrap was first proposed 
by Beranl (119881) as a way of improving the accuracy of bootstrap, but is considerably 
more expensive than bootstrap and becomes computationally infeasible for massive 
data. Fast d ouble bootstrap (FDB, hereafter), which was independently proposed by 
White! ( 2000 1 and Davidson and MacKinnon ( 2000l 2007h . is an interesting alternative 


that only resamples once in the second stage of bootstrapping and can dramatically 


rics, see 


Davidson and MacKinnon 

(2002), 

Ahlgren and Antell ( 

2008), Richard (2009), 

others. Recently, Giacomini. Politis, and White 

(2013 

) applied the idea of 


FDB to reduce the computational cost in running Monte carlo experiments to assess 
the performance of bootstrap estimators and tests. They demonstrated the consis- 
tency of this me t hod and called it a ‘warp-speed method’ to emphasize its rapidness. 


Chang and Hall (2014) recently studied the higher order accuracy of FDB in terms of 


bias correction and coverage accuracy of confidence intervals. In the massive data con¬ 
text, the FDB is still too expensive since its computational cost is about twice the cost 
of bootstrap. Therefore we propose to do subsampling first and then apply the idea of 
a single resample in the double bootstrap step to the randomly drawn subset of mas¬ 
sive data, to evaluate the precision of a statistical inference method. Since our method 
is a combination of subsampling and double bootstrap, we call it subsampled double 
bootstrap (SDB). In the implementation of SDB, we randomly draw a large number 
of small subsets of the data, but instead of bootstrapping the subsets we construct 
only one resample from each subset. Since these resamples have the same nominal 
size as the original data but only a small number of distinct points, SDB retains the 
automatic nature and computational strength of BLB. The ensemble of resamples is 
then used to estimate the precision of the inference method, in the same manner as 
bootstrap. Note that SDB inherits certain features from FDB but is computationally 
much cheaper than FDB. The number of distinct points in the first-stage subsample 
and the second-stage resample of SDB are small compared to the number of distinct 
points in the first-stage and second-stage resamples of FDB, and this makes SDB much 
faster. 

To see the statistical and computational advantages of SDB, note that the esti¬ 
mation time of one SDB iteration is comparable to that of two resamples for a BLB 
subset, and hence SDB can cover a large number of random subsets in the time it takes 
BLB to complete a large number of resamples for a single random subset. Hence, SDB 
can provide a much more comprehensive coverage of the data than BLB within a given 
time budget. Further, given a certain computational time budget, utilizing that budget 
with SDB is straightforward as it does not require the choice of any tuning parameters. 
The practitioner can, just like bootstrap, simply keep running subset-resamples until 
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the time budget runs out. 

The rest of the article is organized as follows. In Section [2] we describe SDB in 
independent data setting. Section [3] demonstrates the consistency of SDB for inde¬ 
pendent data, and Section [I] reports two simulation studies comparing SDB, BLB, 
and bootstrap for independent data. We introduce a time series version of SDB in 
Section [5j Section [6] demonstrates consistency for the dependent case, and Section [7] 
reports two simulation studies for time series data. We provide a data illustration on 
a large meteorological time series dataset in Section [HJ and the article concludes with 
discussion in Section |9l Proofs of the theoretical results are in the Appendix and some 
simulation figures are in supplementary materials. The R codes used for simulation 
and data analysis are also in supplementary materials, as well as the dataset analyzed. 

2 SDB for independent data 

Consider an i.i.d. sample X n = (Aj,..., A%} drawn from some unknown distribution 
P. The parameter of interest is 9 = 9(P), for which an estimate 6 n = 9(X n ) is 
obtained from the sample. (Please see the discussion following Theorem 13.11 for a 
more rigorous definition of the types of parameter and estimator covered under the 
scope of SDB.) Having chosen the estimator, the statistician often seeks to obtain 
further information regarding the precision of the estimator 9(X n ). This requires the 
estimation of some measure involving the sampling distribution of 9(X n ) and the true 
value of the parameter 9. For example, the precision of an estimator can be measured 
by the mean squared error or the width of a 95% confidence interval for 9. 

Such measures of precision can usually be defined in terms of a root function 
T n (9 n ,9) involving the estimator and the parameter. Let Q n = Q n (P) be the un¬ 
known sampling distribution of T n , and assume that the precision measure can be 
represented as £(Q n ) for a suitable functional £(•). For example, suppose 9 is the 
population mean, 9(X n ) is the sample mean, and the measure of interest is the scaled 
MSE nE[{9 n — 9) 2 ]. In our notation, we define the root as T n (Q n , 9) = \fn{9 n — 9), and 
define the functional as £(Qn) = f x 2 dQ n (x), where Q n is the sampling distribution of 
T n (9 n ,9). 

Estimation of £(Q n ) can be performed by a resampling method like bootstrap. 
Let P n be the empirical distribution of the sample X n , then we can approximate 
Q n (P) by Q n = Q n (Pn) • To do so, we generate a large number (R) of resamples 
X = (Xjj, ..., Xj n },j = 1,..., R from the observed sample X n . Treating the origi¬ 
nal estimate 9(X n ) as the population parameter and the resample estimate 9(X*Q as 
an estimated value of this parameter, we compute the root T n (9(X*Q,9(X n )) for each 
resample, and obtain the empirical distribution Q H) r of this ensemble of roots. Condi¬ 
tionally on X n , the empirical distribution Q n ,R converges to the resampling distribution 
Q n (P n ) as R goes to infinity. The underlying idea of the bootstrap is to estimate the 
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unknown sampling distribution Q n of the root function by this empirical distribution 
Qn,R, and estimate the measure £(Q n ) by the plug-in estimator ^(Q h ,r)- 

In conventional bootstrap, each resample contains an average of 0.63n distinct 
sample points — the computational cost of calculating each resample estimate 9(X*) 
is therefore comparable to those of the original sample. Running R iterations requires 
performing this task R times, which can be computationally demanding for massive 
datasets, particularly when it involves computation of complex statistics. This limits 
the application of bootstrap for massive datasets. 

For BLB, we fix a subset size b (typically b = n 7 for some 0 < 7 < 1) and construct 
a suitable number (S) of random subsets, X*\ = {X n , ..., X, Jb } , j = 1 ,... ,S, from 
the observed sample X n . Next, for each subset X* J b , we generate R weighted resamples 
of size n — this can be represented by (X* 3 b , W*^), k = 1,..., R where W*^ = 
{Wi, ■ ■ ■ , W b } is a vector representing the frequencies of (X*\) in the k th resample. 

The weight vector W*^’^ is generated from a multinomial distribution with n trials 
and b cells with uniform chance for each cell, independently of the subset. Treating 
0(X*\) as the population parameter and the resample estimate 6(X*\, W*^^) as an 

estimated value of this parameter, we compute the root T n (6(X*\, d(X*\)) for 

each resample, and obtain the empirical distribution Q 3 n b R of this ensemble of roots. In 
the same spirit as bootstrap, we apply the plug-in estimator £,(Q J nbR ) for each subset, 

and average over different subsets to obtain the estimator ^ Y2j=i £(Qn,b,n) °f €(Qn)- 

We propose a subsampled double bootstrap scheme (SDB) based on subsets in the 
same manner as BLB, but using only one resample per subset. We fix a subset size 
b and construct a large number (S) of random subsets, X*\ = {Xj 1 ,..., Xj b }, j = 
1 ,,S, from the observed sample X^. However, we generate only one resample from 
the j th subset, corresponding to as defined above, and calculate a single root 

T* 3 = T n (0(X* 3 h , W* j’ 1 '*), &(X* 3 h )) from the resample estimate and the subset estimate. 
With this ensemble { R* 3 : j = 1,..., S}, we compute the empirical distribution Q n) b,s 
of roots and estimate i{Q n ) using the plug-in estimator £(Q n ,b,s)- Algorithm [Tj outlines 
the computational steps involved. 

Note that the computational advantages of SDB and BLB relative to the bootstrap 
are applicable only when the estimator 9 of interest can take the weighted data rep¬ 
resentation as its argument. This property holds for a large class of commonly used 
estimators, including M-estimators. For a subset X* b with resample weights W* jb , 
the resample estimate can then be expressed as 9(X* b , W* b ). Since BLB and SDB 
resamples have nominal size n but only 0(b) distinct points, computing the resample 
estimate for these methods is much cheaper than that for bootstrap, which has 0(n ) 
distinct points in the resamples. 
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Input : Data X n = {Xi,. .. ,X n } £(•): measure of accuracy 

6: parameter of interest b: subset size 

6 n : estimator S: number of subsets 

T n ( 6 n , 6)\ root function 

Output: £ (Qn,b,s)■ Estimate of £ 

for j <— 1 to 5 do 

(i) Choose random subset X* 3 b from X n 

(ii) Compute 0(X* 3 b ) from X* j b 

(iii) Generate resample (X*^, from X*\ 

(iv) Compute resample estimate 6(X* j h , from (X* j b , 

(v) Compute resample root: T* 3 = T n (9(X*^W*^ ) ’ 1) ),9(X*^ b )) 
end 

1. Compute empirical distribution of roots: Q n ,b,s = ecdf of (T* 1 ,..., T* s } 

2. Calculate the plug-in estimator £(Q n ,b,s) 

Algorithm 1: SDB algorithm 


2.1 Comparison of SDB, BLB, and Bootstrap 


For both BLB and SDB, the resample estimation step applies to a resample with 0(b) 
distinct points, whereas in bootstrap the resample has O(n) distinct points. This makes 
SDB and BLB computationally much cheaper than bootstrap when b « n. Denote 
the computational time for performing the estimation process 6 on a sample of size m 
by t(m). In this formulation of computational time we focus on sample size to illustrate 
the resampling methods, and ignore other factors affecting computational time. For 
an estimator that can take the weighted data representation, the estimation time for 
a resample with nominal size n but only b distinct points is t(b). The estimation time 
for bootstrap, BLB, and SDB, for conducting inference in one original data sample, 
are listed in Table [[] where the symbols have the same meaning as earlier. Bootstrap 
requires estimation on the original data and its R resamples. Each BLB subset requires 
estimation on the subset and its R resamples. Each SDB subset requires estimation 
on the subset and t he single resample. 

For BLB, Kleiner et ah ( 20141) recommends R = 100 and a small value of S (2-10 
depending on b). For illustration, let n = 100,000 and b = n 0 6 , then the number of 
distinct points in each resample is at most 1000, resulting in much faster computation 
than bootstrap. However, in terms of sample coverage, each subset can cover at most 
1% of the data, so 10 subsets can at best cover 10% of the data at an expense of 
1010 xt(b). The SDB can run more than 500 different subsets at the same expense, 
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Name 

Estimation Time 

Bootstrap 

(R + 1) x t(n) 

BLB 

S(R + 1) x t(b) 

SDB 

2 S x t(b) 


Table 1: Estimation time for different resampling methods 


providing a far more comprehensive coverage of the data. 

Further, given a certain time budget, it is not clear how to choose the tuning pa¬ 
rameters R and S th at wi ll provide optimal statistical accuracy for BLB. The adaptive 
method proposed by Kleiner et ah ! 1 2014 /) provides an interesting alternative by choos¬ 
ing a tolerance parameter e instead of R and S. But even then, it is not clear how to 
choose an appropriate e in practice, since t(m) is not known a priori, and neither do we 
know the estimation variability as a function of sample size. For the SDB (with a given 
subset size) and the bootstrap, the estimation time involves only one parameter, the 
number of resamples (or subsets), and hence the practitioner can simply keep running 
resamples until the time budget runs out. 


3 Theory for independent data 


In this section, we provide a theoretical analysis of the SDB in a general empirical 
process setting. Consider a class of functions T [each element mapping from to M]. 
Denote by the space of bounded functions which map from T to M. To describe 

consistency of the SDB, consider the SDB-process 




b 

J2( W i,n-n/b)f(X R - l{i) ). 

i= 1 


Here, W := (W hn ,..., W b , n ) ~ Multinomial*, (n, 1/b ,..., 1/b) independent of X*,..., X n 
and R follows a uniform distribution on the permutations of {1, ...,n} and is inde¬ 
pendent of Xi, ..., X n , W. Note that in empirical process settings, it is important to 
specify the underlying probability space. This is done in the mathematical appendix 
[see Section 110. lj . In order to show that the SDB ‘works’ in a process setting, we 
need to establish that the distribution of the SDB process G„ b [conditional on the 
observations X*] is close to the distribution of the empirical process G n where 

1 n 

M/):=^£(/(V)-E[/(V,)]) 

Jn “ 


when both are viewed as elements of £ oc (J~’). To this end we show that the SDB-process 
converges in distribution, conditionally on the data Xi, ...,X n , to the same Gaussian 
process as the empirical process G n . 
















Theorem 3 . 1 . Assume that T is a Donsker class for P, that X t ~ P are i.i.d. and 
that additio nally \= {f — q : f, q £ J r ,P(f — g) 2 < 5} is measurable in the sense 
discussed in Gine and Zinn 1 198^) for each 5 > 0. Then we have for min(n, b) —> oo 


&n, b 


W,R 


in where G denotes a centered Gaussian process with covariance E[G(/)G(g)] = 

cov(f(X),g(X)). 

p 

In the above Theorem, conditional weak convergence is in the sense described 

W,R 

Kosorokl (2008), Section 2.2.3. A proof of this result can be found in the mathemat¬ 


m 


ical appendix [Section flO-lj . 

One remarkable fact about Theorem 13.II is that, in addition to T being P-Donsker, 
the only requirement on the class of functions J- is a mild measurability condition. This 
means that the SDB on a process level ‘works’ whenever the corresponding functional 
central limit Theorem holds true (up to the mild measurability assumption on the 
function class J -), i.e. the SDB can be applied in a very wide variety of settings. The 
proof relies on basic tools from empirical process theory [in partic ul ar, a fundamental 
result on the exchangeable bootstrap, see Theorem 3.6.3 in Ivan der Vaart and Wellnerl 
f 19961 )]. but is completely different from the proof of Theorem 1 in Kleiner et al.lfl2Q14h . 
The reason is that in the BLB one initial subset is fixed, while in the SDB a different 
subset of the data is used in each iteration. The latter fact poses additional challenges 
for the theoretical analysis of SDB. 

Theorem 13 .1 1 provides a fundamental building block for the analysis of SDB. Com¬ 
bined with the continu ous mapping theorem and functional delta method for the boot¬ 
strap [see for instance Kosorok ( 20 08). Theorem 10.8 and Theorem 12.1], it can be 
utilized to validate the consistency of SDB for a wide range of applications. For illus¬ 
tration purposes, let us consider an application of the functional delta method for the 
bootstrap with the root T n {9 n ,9 ) := \/n(9 n — 9). Assume that we are interested in 
conducting inference on a parameter 9 which can be represented as 0((/ P/)/ e jr), 

and the estimator takes the form 9(X) = 0((/ i —> P n /)/ e j-) for a suitable map 0. More 
precisely, we assume that 0 satisfies the following condition 

(H) There exists a V which is a vector space with V C £ 00 (J r ) such that the sample 
paths of G lie in V with probability one. The map 0 : —> M. k is com¬ 

pactly differentiable tangentially to V in the point PI : / h->- Pf. Denote the 
corresponding derivative by (j>' H . 

For / G J 7 , write P n ,b/ := y Yli =i ^i,nf{X R - ip)). Then, in the notation from Section[ 2 ] 

we have 9(Xff b , W,! ^’ 1 ^) = 011/ ^ P n,bf) /gt)- Now the delta method for the bootstrap 
[Theorem 12.1 in K osorok ( 120081 )] yields for min(n, b) —> oo 

Ud(Ki. Kf ’), hXn)) = >)) - »(*„)) 


W,R 


f H G. 
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At the same time, the classical functional delta method yields 


T n (9(X n ),9) = sfti(9(X n ) - 0) <f>' H &. 


Assume measurability of 9((X*^ b , 6(X n ). Write C for the distribution of p' H G, 

C n for the distribution of T n (0(X n ),0), and denote by C^ b (R, W) the distribution of 
T n (9(X*p h , 6(X n )) conditional on R , W. Denoting by d a metric on the space of 

distributions on which metrizes weak convergence, we have proved that d(£ n , C) —> 

0 as n —> oo and d(£f b (R, W), C) — > 0 in outer probability as min(n, b) —> oo. In 
particular, this shows that for any map £ from the space of distributions to which is 
continuous in the point C with respect to the metric d, we have £(£f fe (-R, W "))—£(£) —>■ 
0 in outer probability. This shows that the conclusion of Theorem 1 in 
(2 0141 ) continues to hold in the SDB setting. 


Kleiner et al. 


4 Simulation study for independent data 

In this section, we report two simulation studies comparing the performance of boot¬ 
strap, BLB, and SDB in la rge s imulated datasets in the i.i.d. framework. We used 
model settings similar to Kleiner et ah (2014). Since they have already demonstrated 
that BLB performs better than the m out of n bootstrap and subsampling, we did not 
include these methods in our study. 


4.1 Multiple Linear Regression 

Consider a d- dimensional multiple linear regression model 

Vi = Pix itl + ... + /3 d x i4 + e-i 

for i = 1,... ,n. Our parameter of interest is the d- dimensional vector of slope co¬ 
efficients, whose true value is (3 = (Pi,..., Pd) — (1, • • •, l)h We use the usual OLS 
estimator p. We also want to construct a simultaneous 95% confidence region for p. 
Traditionally we use the F-statistic 


Tn(P,P) = 


(P-p)'X'X(P-P)/d 

(y-XP)'(y-XP)/(n-d-l) 


to construct the joint confidence region. Let g 0 .95 be the 95% quantile of the true (un¬ 
known) distribution of T n (P, P). Then the confidence region is given by {P : T n (P, P) < 
W95}• br general the true distribution of T n , and hence its quantile go. 95 , is unknown. 
But it can be estimated by the resampling techniques described in the previous section, 
with £,(Q n ) = <7o.95 where Q n is the true distribution of T n . 
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In our simulations, we use a model from the simulation study of iKleiner et al. 
( 2014 ). We generate x t , 3 ~ t 3 and e* ~ 1V(0,100) independently. For normally dis¬ 
tributed errors, we know that T n ~ F(d,n — d — 1), and hence the true quantiles are 
given by those of the corresponding F distribution. We define the error rate as 


| Q0.95 
Q0.95 


- II 


where q and q represent the estimated and true quantiles of T n , respectively. We use 
subset size b = n 7 with 7 = 0.6, 0.7, 0.8 for both BLB and SDB, and let n=100000, 
d=100. Following Kleiner et ah (2014) we fix R = 100 for BLB. We allowed the 
competing methods to run for 60 seconds. 


4.2 Logistic Regression 

Consider a d-dimensional multiple logistic regression model 

Hi Ber(pi) where pi = + ... + /3 d x itd 


for i = 1 ,n. Our parameter of interest is the d-dimensional vector of slope co- 
efficients, whose true value is 3 — (/3i,..., /3 d ) = (1,...,1)'. We use the maximum 
likelihood estimator j3 n which does not have a closed form expression for this model, 
but can be numerically computed using a Newton-Raphson method. We use the R 
function glm for fitting the model. As before, we want to construct a simultaneous 
95% confidence region for /3. Define the root function 


T n 0,0) = 0-0^0-0) 

where E = YH=i [i+clpp-'c)] 2 XiX i- ^°- 95 ^ ie ^5% quantile of the true (unknown) 

distribution of T n ((3,/3). Then the confidence region is given by {/3 : T n ((3,f3 ) < go. 95 }- 
In general the true distribution of T n , and hence its quantile qo. 95 , is unknown. But it 
can be estimated by the resampling techniques described in the previous section, with 
£(Q n ) — Qo .95 as the target precision parameter, where Q n is the true distribution of 
T 

± 71 * 


We generate Xij ~ t 3 and obtain a numerical approximation of go. 95 using 10000 
Monte Carlo simulations. As before, we define the error rate as | go. 95 /go. 95 — 1|. We use 
subset size b — n_ with 7 = 0.6, 0.7, 0.8 for both BLB and SDB, and use n=100000, 
d=10. Following Kleiner et ah (20141 we £x R = 100 for BLB. We allowed the com¬ 
peting methods to run for 20 seconds. 
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4.3 Comparison of SDB, BLB, and Bootstrap 


The methods are compared with respect to the time evolution of error rates. Note that 
this is different from conventional analysis where error rates from competing methods 
are compared for the same number of iterations. This makes sense because different 
methods have different estimation time profiles (as formulated in Table [1]), and we want 
to investigate which method is the fastest to produce reasonably accurate results. We 
consider a time grid 1 , 2 ,..., 60 (in seconds) and at each time point t, we look up the 
latest iteration that was completed by this time, and calculate the error corresponding 
to the estimate £ from cumulative iterations including that iteration. For each method 
and any t, this can be interpreted as the error rate obtained by that method for a 
given computation time budget of t seconds. Different methods will have different 
numbers of iterations completed within the same time budget. For bootstrap and 
SDB, latest iteration m eans t he l atest completed resample or subset-resample, while 
for BLB (following Kleiner et ah! 1 2014 )) the latest iteration means the latest completed 
subset. Note that till the first iteration is complete, we do not have an estimate £, so 
we consider the error rate to be 1 till the first iteration is completed. Error rates are 
averaged across 20 Monte Carlo simulations. 

Figure |T] shows the time evolution of error rates for bootstrap, BLB, and SDB. 
Bootstrap has the highest computing cost which gets reflected in its slow convergence. 
The performance of BLB and SDB are close to one another for generous time budgets, 
but for lower time budgets SDB performs better by quickly giving a reliable estimate 
while BLB takes some time to complete the first subset. This phenomenon becomes 
particularly prominent for higher values of b as BLB’s computing time for each subset 
becomes large. For small time budgets even bootstrap can beat BLB when b = n 0,8 , 
since the time taken by BLB to complete a subset can exceed the given budget. A 
similar phenomenon for small time budgets was observed in the simulation study of 
Kleiner ct al. ( 2014 1 (see Figure 1(a)—(c) in their paper for linear regression and 2 
(a)—(c) for logistic regression), where bootstrap estimates are available but BLB esti¬ 
mates are not available yet for subset size b = n 0 ' 8 or b = n 0,9 . 


Remark 4 . 1 . Computing time for the resampling methods depends on various aspects 
of the computational infrastructure used, e.g. the processing power of the computer, 
storage capacity, and statistical software or computing platform. All our simulations 
were performed on a desktop computer with Intel(R) Core(TM)2 Duo CPU E8400 
@3.00 GHz processor and 4 GB RAM, running R version 3.0.1. The computational 
infrastructure influences the computing time of various resampling methods in identical 
manner, so the relative performance of these methods should be qualitatively similar 
in a different infrastructure, even if the absolute performances might vary. 


Remark 4 . 2 . It is relevant to note that while we have used models from Kleiner et al. 


(2 0141 ) in these studies, the precision measure and the method of comparison between 
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resampling schemes are slightly different. They constructed marginal confidence inter¬ 
vals for the individual regression coefficients, and combined results for the d coefficients 
by averaging the error rate over the dimensions. Thus they are interested in measures 
of precision of the individual estimation tasks of estimating the d coefficients. However, 
in a multivariate regression setting, the joint estimation task of all coefficients taken 
together might be of more interest. Accordingly, we constructed a simultaneous confi¬ 
dence region for the d-dimensional vector of regression coefficients to assess precision 
of this joint estimation task, and compute error in terms of this_confidence region. 

For comparison between resampling schemes, Kleiner et ah (2014 ) allowed the com¬ 
peting resampling schemes to converge, and for each iteration (defined as a complete 
subset for BLB and resample for bootstrap), they computed the average cumulative 
computing times and average error rates from five Monte Carlo simulations. They 
compared resampling schemes on the basis of this average time vs average error trade¬ 
off. In our simulations, we compare methods on the basis of error rate achieved for a 
given time budget, over 20 Monte Carlo simulations. Thus time is not averaged across 
simulations — rather, at a fixed point in time we look up the error rates obtained by 
this time in the Monte Carlo simulations, and average them. If at a certain time point 
no estimate is available yet (no iteration has been completed), we assign an error rate 
of 1. 

In particular, in our simulation plots, the error rate changes only when an iteration 
has been completed. This makes them look ‘jerky’ and unstable as there are long 
stretches of a flat line followed by a sudden drop. Since estimates change only upon the 
completion of a new iteration, the arrival of new estimates is actually an intermittent 
process rather than a continuous process with respect to the time axis, and the error 
rate does not change unless a new estimate is available. Therefore it is realistic that 
error rates change in a ‘jerky’ fashion rather than smoothly, and this is not a symptom 
of instability. 


Remark 4 . 3 . Several bootstrap approaches exist in a regression setting — for example 
paired bootstrap, residual bootstrap, wild bootstrap and so on. In these simulations 
we have implemented the paired bootstrap, where both regressors and response are 
resampled. The paired bootstrap method can be naturally extended to the BLB and 
SDB algorithms, but it is unclear whether there are straightforward extensions for 
residual bootstrap or wild bootstrap. 

As pointed out by a referee, another alternative is to look at bootstrap p-values 
instead of confidence regions for regression models. In our formulation £ is a parameter 
associated with the sampling distribution Q n of the root function T n (which in this case 
is the F-statistic), while the p-value is a statistic. However, one can implement Algo¬ 
rithm H] to ‘estimate’ the true p-value using the empirical distribution Q n< b,s- Limited 
(unreported) simulation results suggest that SDB still possesses the same advantage 
over BLB and bootstrap, which is reported for confidence region. A more careful in¬ 
vestigation regarding the suitability of SDB for approximation of the p-value in theory 
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and finite sample simulations is left for future work. 
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Figure 1: Time evolution of error rates for multiple linear regression with d=100 
(left column) and multiple logistic regression with d= 10 (right column). Sample size 
n=100000, subset size is b = n 1 where 7 = 0.6 (top row), 7 = 0.7 (middle row) and 
7 = 0.8 (bottom row). Bootstrap errors are represented by solid lines, BLB errors by 
dashed lines, and SDB errors by dotted lines. Errors are averaged over 20 simulations. 
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5 SDB for time series data 


In this section, we extend SDB to time series data. Note that Kle iner et ah ( 20141 ) 


have briefly mention ed an exten sion of BLB to the time series setting using stationary 
boo tstrap (Politis and Romano ( 1 994bli '). however no rigorous theory is provided. Also 


Laptev, Zaniolo, and Lu (2012) for a recent implementation on a large Twitter 


see 
dataset. 

Suppose we observe X n = {A t }" =1 , which is a stretch of length n from the strictly 
stationary time series {X t } teZ , and let P denote the joint probability law that governs 
the stationary sequence. Let 9 = 9(P) be our parameter of interest, and suppose we 
have an estimator 9 n (X n ) which is a measurable mapping from X n to R. As with 
independent data, we are interested in evaluating the precision of this statistical infer¬ 
ence. As before, this can be formulated in terms of a root function T n (9 n ,6 ), and the 
precision can be expressed as £(Q n ) where Q n is the true (unknown) distribution of 
T 

For BLB, we first construct subsets of the original sample by randomly choosing 
a continuous stretch of data X* h — {X J+i }^ =1 where 0<J<n — b+1 and the 
subset size b is fixed beforehand. From the j th subset, we then construct R weighted 
resamples (X*\, W* !f k> ) for k = 1 ...., R of size n using the Moving Block Bootstrap 
(MBB, hereafter) of Kiinschl 1 19891 and Liu and Singh ( 1992 1. We use MBB instead of 
stationary bootstrap used by Kleiner et a l. (2014) since the MBB is conceptually sim¬ 
pler, and easier in terms of theoretical treatment. For this, we consider some suitable 
block length L < b and divide the subset {Xj,X J+lj ..., Xj +b _{\ into an ensemble of 
overlapping blocks {X A, W+i, • • •, -W+l- 1 } where J<i<J + b — L + 1. The resample 
is constructed by concatenating blocks that are randomly sampled from this ensemble, 
till we obtain a chain of size n. Note that when n is not a multiple of L , we will need to 
take a fraction of the final block in order to obtain a resample of length exactly n. This 
gives us an ensemble of roots of the form T n (9(X* j b , W*^ J ’ ' ) ), 9 b (X* j b )), k = 1,..., R, and 
we use the empirical distribution of this ensemble to approximate the unknown distri¬ 
bution of T n (9 n , 9). Averaging over j = 1,..., S subset estimates then gives the BLB 
estimate of precision. 

For SDB, for each subset X*-* b , we generate only one MBB resample (X* 3 hl W* ( '^ 1 ' 1 ) 

to construct the root T n (X* j b , 9 b (X* j b )). We do this a large number ( S ) of 

times to generate an ensemble of roots, and use the empirical distribution of this 
ensemble to approximate the unknown distribution of T n (9 ni 9). Algorithm [2] outlines 
the computational steps involved. 

In the time series case, estimation time can be formulated as t(m) in a manner 
similar to Section 12.11 where m is the number of distinct points, and the estimation 
times listed in Table |T] apply with MBB taking the place of bootstrap. MBB requires 
estimation on the original data and its R resamples. Each BLB subset requires esti¬ 
mation on the subset and its R resamples. Each SDB subset requires estimation on 
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Input : Data X n = {Xi,... ,X n } £(•): measure of accuracy 

9: parameter of interest b: subset size 

9 n : estimator L: block length 

T n ( 6 n , 9): root function S: number of subsets 

Output: £ (Qn,b,s)■ Estimate of £ 

for j <— 1 to 5 do 

(i) Choose random subset X* 3 b = {Xj +i }^ =1 from X n where 0<J<n — b + 1 

(ii) Compute 9(X*\) from X*\ 

(iii) Choose k — n/L blocks by randomly sampling k starting points 
(fi,..., tk) from { J + 1,..., J + b — L + 1} with replacement 

(iv) Construct resample weights for subset: 

Initialize: 1} <— (^^0,) 

b 

for i <— 1 to k do 

w-f + ) 

t% — 1 L b-ti—L-\-l 

end 

(v) Compute resample estimate 6(X* j h , from (X* 3 h , 

(vi) Compute resample root: = T n (9{X* 3 b ,yV*^ l) ),9(X* 3 b )) 

end 

1 . Compute empirical distribution of roots: Q n ,b,s = ecdf of {i?* 1 ,..., R* s } 

2 . Calculate the plug-in estimator £(Q n ,b,s) 

Algorithm 2: SDB time series algorithm 


the subset and the single resample. 

Broadly speaking, the time series version of SDB retains the advantages discussed 
in Section 12.11 in the context of independent data. For a given computational time 
budget, by using a single resample for each random subset SDB can accommodate 
much more comprehensive coverage of data than BLB. Also, BLB involves tuning 
parameters R and S (or e under the adaptive method) whose selection can be non¬ 
trivial, while SDB and MBB do not require this type of tuning parameter selection. 
However, an important tuning parameter in the time series setting is the block length 
L which can affect both variability and the accuracy of the estimate of precision in all 
three resampling methods. 
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6 Theory for dependent data 


We begin by setting up a mathematical framework for SDB in the dependent case. 
Throughout this section we assume that the observations X±,X n stem from a strictly 
stationary time series {X t } te z- Given a sample Ad, X n , the SDB procedure for time 
series can be described through the following steps. 

1. Pick a random variable J which is distributed uniformly on 0,...., n — b — 1. This 
corresponds to the first step of randomly selecting a block of length b from the 
complete data. 

2. Choose K = \n/L\ random variables si,...,sx which are i.i.d. and distributed 
uniformly on 0, ..., b — L + 1. Generate the sample Xf, ..., X* by setting 

( X kL+l, -, X (k+l)L) : = ( X J+s k , ■~,Xj+s k +L- 1 ) 

3. After the first two steps above, one realization of the SDB process is given by 

1 n , b 

v i =i j =i 

Repeat a large number of times, each time generating a new J, si,..., sk- 

As in the case of independent observations, our aim is to establish validity of the 
SDB for general classes of functions. In contrast to the i.i.d. setting, where the 
‘classical’ bootstrap for empirical processes is well understood, there are very few results 
on bootstrap validity for general empirical processes. All of the available results rely 
on the notion of /3-mixing to measure dependence. More precisely, the k’th /3-mixing 
coefficient /3 (k) with k G N is defined as 

m~l sup £ |P(4nBj)-P(4)P(Bj)l 


where the supremum is taken over all finite measurable partitions (Aj)ig/, (of 
cr(A t,t < 0) and a(X t ,t > k ), respectively. As of this writing, we are aware of only 
three articles that deal with bo otstrap validity for general empirical processes based 
on dependent observations. Biihlmann (119951 ) considers the moving blocks bootstrap 
under exponential decay of the /3-mixing coefficients and classes of functions with poly¬ 
nomial bracketing numbers. Radulovic (199 61) establishe s the v alidity of the moving 


blocks bootstrap for VC [see van der Vaart and Wellner (1996), Chapter 2.6.2 for a 


definition] classes of functions unde r conditions on polynomial decay of /3-mixing co¬ 
efficients. Finally, Radulovic (120091 ) revisits the disjoint blocks bootstrap and proves 
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its validity under generic conditions on the function class and decay of /3-mixing co¬ 
efficients. The latter paper also contai ns a nic e ov erview of literature on bootstrap 
validity under dependence [see also Radulovif] ( 2002 1 for a review of e_arlicr results]. 
Our main result can be viewed as an analogue of Theorem 1 in Radulovifl ( 199(1 ). 


T heor em 6 . 1 . Assume that IF is a permissible [as defined on page 228-229 in Kosorok 
( 200&) 1 VC class with envelope function F such that E[F p (Ah)] < oo for some p > 2. 
Assume that the mixing coefficients /3 satisfy {3(h) < k~ q for some q > p/(p — 2 ). If 
additionally there exist K > 0 ,7 > 0, 0 < p < 2( P JU such that b, L satisfy 


n 


= o{n~ K ), L^r oo, L = 0(b p ), n 1/2 = 0(b^~ lh ), 


we have 


G [[b ^ G 

’ j,s 


in £°°(F) where G denotes a centered Gaussian process with covariance structure 
E[G(/)G( 9 )] = ^WPfOsPt,)] - EI/fXOlEbfX,)]). 


tez 


A proof of this theorem is in the mathematical appendix [Section 110.2] . Theorem 
16. II shows that the time series version of SDB also works in a wide range of settings. In 
particular, the continuous mapping theorem and delta method for the bootstrap can be 
employed in the same fashion as discussed at the end of Section [3] We conjecture that 
the assumptions on the dependence can be weakened if we consider more specialized 
classes of functions, such as indicators of rectangles which would lead to the ‘classical’ 
empirical distribution function. 


7 Simulation study for time series 

In this section we report the numerical performance of SDB, BLB, and MBB in two 
simulation studies involving large time series data. 

7.1 Median of AR(1) process 

Consider an AR(1) time series formulated as 


A \ — pX t ~i + e t 

of length n = 100, 000 and random innovation et ~ IV(0,1). The parameter of interest 
is the population median M. We define T n = s/n(M n — M) where M n is the sample 
median. We are interested in evaluating the precision of the estimator M n . Our 
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measure of precision is a quantile of the distribution of T n , i.e. £ = q a (T n ) which can 
be used for constructing confidence intervals, for example, with a = 5%, 95% we can 
construct a 90% confidence interval. 

We obtain the ‘true’ value £ true from 10000 simulations. The error rate is measured 
by | £/£trite — 1 | ■ We implement and compare the three resampling methods, namely 
MBB, BLB, and SDB. Block length is L — 10, 20, 50 for all methods, and we use subset 
sizes b = 5000,10000 for BLB and SDB. We allow each method to run for 60 seconds 
for L = 20, 50 and 120 seconds for L = 10, to allow BLB to complete one subset. 


7.2 Time Series Regression 


We also studied the relative performance of MBB, BLB, and SDB in the time series re¬ 
gression framework (se e e.g. [Andrews and Monahan (1992), Kiefer. Vogelsang, and Bunzel 
( 2000 1. Rho and Shao ( 20131) ). Consider the time series regression model 


Ut — + u t 


for t — 1 ,,n where (3 is a d x 1 vector of regression coefficients, X t is a d x 1 vector 
of stationary regressors, and u t is a stationary error process tha t satisfies E [tq | X t \ 
= 0. We considered the AR(l)-HOMO regression model of lAndrews and Monahan 
(119921) where the d regressors and the error process are mutually independent, mean 
zero, homoskedastic, AR(1) processes with autocorrelation p and standard normally 
distributed innovations, and set j3 — 0 . We set n = 100, 000 and d = 10, and use 
p = —0.8, 0.5, 0.9. Similar to Section 14.11 the parameter of interest is /3, estimator of 
choice is the least-squares estimate /3, and we measure precision by constructing a 95% 
confidence region for j3 using the F-statistic. We obtain the ‘true’ value of go .95 from 
10000 simulations. We define error rate by |^o. 95 /t 7 o .95 — 1| as before, and use block 
lengths L = 10,20,50, subset sizes b = 5000,10000 for BLB and SDB. To allow BLB 
to complete one subset, we ran each method for 150 seconds for L = 10, 90 seconds 
for L = 20, and 60 seconds for L = 50. 


7.3 Comparison of SDB, BLB, and MBB 

The methods are compared with respect to the time evolution of error rates, as dis¬ 
cussed in Section 14.31 Error rates for different methods are averaged across 20 Monte 
Carlo simulations. Results for L — 50 are displayed in Figures [2] |3j and [4j To save 
space, results for L = 10, 20 are presented in supplementary materials. We can see 
that SDB shows significant advantages over its competitors. In particular, it is encour¬ 
aging to observe that for shorter time budgets (half of total runtime or less), SDB has 
a clear advantage over the other methods in most cases. SDB can give a reasonable 
estimate by 10-15 seconds in most cases, while the BLB can take a substantial time 
to complete a single subset. MBB has highest computing cost which is reflected in its 
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slow convergence, but it appears that MBB can often provide a reasonable estimate 
by the time taken by BLB to complete one subset, which is consistent with our finding 
in the iid case. 

An interesting aspect of these results is that block length affects both running time 
and accuracy. The behavior of the resample estimate depends on block length, and 
this affects accuracy of the resampling methods. The dependence of running time on 
block length comes from the fact that construction of resample weights (Step (iv) of 
Algorithm [2]) depends on the number of blocks in the resamples, and this step affects 
running time of the algorithms, ffowever, the advantages of SDB in our numerical 
results are consistent over the various values of block length used. 


8 Data Analysis 


We apply our method to analyze the Central England Temperature (CET) dataset, 
which is a meteorological time series dataset consisting of 228 years (1780-2007) of av¬ 
erage daily temperatures in central England. The CET dataset represents the longest 
continu ous t hermometer- based te m peratu re record on earth, an d was previously ana¬ 
lyzed by Z ha ng, Shao, Havhoe, and Wuebblesj (2011) and lBerkes. Gabrvs. Horvath, and Kokoszka 
( 2009( 1 in the context of inference for functional time series. In our analysis, we treat 
the dataset as an univariate time series sample of daily average temperatures. The 
sample size is n = 228 x 365 = 83220, where we ignore leap years. We remove season¬ 
ality by subtracting from each observation the mean temperature for that calendar day 
across 228 years. Our parameter of interest is the population mean /i of this univariate 
time series. We use sample mean X (calculated from the n = 83, 220 observations after 
removing seasonality) as the estimator of /i, and we want to construct a 90% confidence 
interval for /i to assess the quality of estimation. We define T n = y/n(X — fi) as the 
root function, and let the precision measure £ = (g 0 .95 — Qom) be the width of the 90% 
confidence interval. 

We applied MBB, BLB, and SDB on this dataset with block length L = 10, 20, 50 
and subset size b = 5000,10000. MBB was allowed to run for 600 seconds, while BLB 
and SDB were allowed to run for 300 seconds. Figure [5] displays the time evolution of £ 
for the competing methods. Note that in this empirical example the true width is not 
known, however it appears that for any given block length, the three methods converge 
to similar estimates of the width. MBB is the slowest to converge, and continues to 
display substantial oscillations well after BLB and SDB have stabilized. BLB and SDB 
quickly converge to stable estimates, but for small time budgets, SDB stabilizes faster. 
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Figure 2: AR(1) simulation results with £ = 95% quantile of T n = y/n(M n — M), 
sample size n=100000, block length L=50, autocorrelation p = -0.8 (top row), 0.5 
(middle row), 0.9 (bottom row), and subset size b = 5000 (left column) ,10000 (right 
column). The plot displays the time evolution of error rates from 20 simulations when 
each method was allowed to run for 120 seconds. MBB errors are in solid lines, BLB 
in dashed lines, and SDB in dottted lines. 
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Figure 3: AR(1) simulation results with £ = 5% quantile of T n = yjn(M n — M), 
sample size n=100000, block length L=50, autocorrelation p = -0.8 (top row), 0.5 
(middle row), 0.9 (bottom row), and subset size b = 5000 (left column) ,10000 (right 
column). The plot displays the time evolution of error rates from 20 simulations when 
each method was allowed to run for 120 seconds. MBB errors are in solid lines, BLB 
in dashed lines, and SDB in dottted lines. 

9 Discussion 

In this article we present a new resampling method, called subsampled double bootstrap 
(SDB), for estimating the precision of inference methods in massive data. Our method 
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Figure 4: Time series regression simulation results with £ = 95% quantile of T n = 
MSM/MSE , sample size n=100000, dimension d = 10, block length L=50, autocor¬ 
relation p = -0.8 (top row), 0.5 (middle row), 0.9 (bottom row), and subset size b = 
5000 (left column) ,10000 (right column). The plot displays the time evolution of error 
rates from 20 simulations when each method was allowed to run for 60 seconds. MBB 
errors are in solid lines, BLB in dashed lines, and SDB in dottted lines. 


applies to both independent data and stationary time series data. The main idea is to 
select small random subsets of the data and construct a single full size resample from 
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Figure 5: Time evolution of £ for CET dataset (measured in Celsius), where £ = 
(qo. 95 -qo. 05 ) is the width of the 90% confidence interval for /x based on T n = y/n(X— /x). 
MBB was allowed to run for 600 seconds and BLB, SDB for 300 seconds. Block length 
L=50 (top row), 20 (middle row), 10 (bottom row), and subset size b = 5000 (left 
column) ,10000 (right column). MBB estimates are in solid lines, BLB in dashed lines, 
and SDB in dottted lines. 


eac h random subset, in a manner rem iniscent of fast double bootstrap ( White ( 2000h 
and lDavidson and MacKinnon (2000)). Our method inherits the theoretical strengths 
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and automatic nature of classical resample b ased methods like bootstrap (Efron ( 1979h ) 
in the independent data context and MBB ( Kiinsch f 1989 1. Liu and Singh ( 19921) ) in 
the time series context. It also inherits the computational strengths of subsample 
based methods like subsa mpling (IPolitis and Romanol (J1994aJ)) and m out of n boot¬ 
strap (Bickel et al. ( 19971)1 . The advantage of our method over the recently proposed 
BLB ( Kleiner et al. ( 20141) ) lies in sample coverage, running time, and automatic im¬ 
plementation without having to choose additional tuning parameters under a given 
time budget. Simulation studies and data analysis examples demonstrate the advan¬ 
tage of our method over BLB and boostrap (i.i.d. case) or MBB (time series case) for 
a given computational time budget. 

An important practical aspect of both SDB and BLB is the choice of subset size. 
Increasing the subset size leads to increasing benefits in terms of statistical accuracy but 
at an increasing computational cost. In the time series case, the regularity conditions 
of Theorem 16.11 impose some restrictions on b for consistency. In practice, for a given 
computational time budget, it remains unclear how to choose an optimal subset size 
that balances statistical accuracy and running time. A closely related problem is 
the selection of optimal block length for the time series version of SDB and BLB. 
In the context of classica l resampling methods this prob l em h as been well studied by 
Hall. Horowitz, and Jing ( 1995 ). Biihlnmnn and Kiinsch ( 1999 1 among others. For the 
time series version of SDB and BLB, we conjecture that the choice of optimal block 
length is associated with subset size in addition to sample size and other parameters. 
We leave these interesting directions to future work. 

Further, it is worth mentioning that the higher order accuracy of BLB was studied 
by 
by 


Kleiner et all (1. 2014) and higher order accuracy of FDB has been recently studied 


Chang and Hall (201 4) . A relevant next step is a theoretical comparison of SDB 


and BLB which will help identify scenarios where SDB works better than BLB, or vice 
versa. This comparison will involve studying higher order properties of SDB, and we 
plan to consider this in future research as well. 

10 Appendix: Proofs of theoretical results 


10.1 Proof of Theorem 13.1 


We begin by setting up a probabilistic model for the SDB in the i.i.d. setting. When 
dealing with empirical processes which are defined on classes of functions, measura¬ 
bility questions play an im porta nt role and it is crucial to state what the underlying 


probability space is- see Dudley ( 1999|) . Chapter 3.1 (page 91), for a discussion of 


related matters. Here we will consider the following setup. 


(P) Consider a product of three probability spaces (fl”, A™, Assume that 

the observations Xi,..., X n are defined as coordinate projections on (fl™, M™, P”), 
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which is itself a product of n identical probability sp aces [t his is a standard 
assumption in empirical process theory- see for instance Dudley! (1999), Chapter 
3.1]. Additionally, assume that on f Xf we have a random vector (W^ n , ..., Wj, ,n) ~ 
Multinomialf ) (n, 1/6,..., 1/6) and that on fig we have a random variable R which 
follows a uniform distribution on the permutations of {1, In what follows, 

denote the set of permutations of { 1 ,..., n) by a{n). Also, we assume without loss 
of generality that Qf are finite for i — 2,3 and that for i — 2, 3 the sigma-algebra 
A™ is the power set of D/. 


Throughout this proof, write W for the vector (Wi jn ,..., W& )Tl ) and X for the vector 
(Ad, ...,X n ). Define the map 


/ ^ (. Z n (R,X,W))(f ) := —j= ~ n /b)f(X R -i(i)), feX. 

\ n f—' 


2=1 


Note that (Z n (R,X,W))(-) can be viewed as an element of the space of functions 
£°°(X). Throughout the proof, denote by BLi the set of Lipschitz continuous functions 
g : £°°(X) —> M with Lipschitz constant 1 that are additionally uniformly bounded by 
1. Also, use the notation /*, /* to denote smallest measurable majorants and greatest 
measurable minorants, respectively. For maps of several arguments, it will sometimes 
be necessary to take measurable majorants and minorants with respect to only some 
of the arguments. For example g(r, X,W)* X ’ W will be used to denote the smallest 
measurable majorant of the map (x,w) i— > q(r , x, w) with r being held fixed. With this 
notation, we need to show that [see Kosorokl ( 2008 1. Section 2.2.3] 


(i) 


sup 

/iGBLi 


E R ,wh(Z n (R,X,W))-E[h(G)\ 


^ 0 . 


(ii) For all h G BLi 

E R , w h(Z n {R, X, W)Y - E R , w h(Z n (R, X, W))* ^ 0. 


Here, E R W denotes the expectation with respect to R , W. Note that the map (R, W) (->• 
h(Z n (R, X, W )) is measurable outer almost surely since R , W are defined on complete, 
discrete probability spaces. 

Proof of (i) Write 


K RiW h(Z n (R,X,W)) = - V E w \h(Z n (r,X,W)) 

n\ L 


r*G<r(n) 
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Then 


E 


'A' 


< ®a 


< E x 


1 


sup E RtW h(Z n (R, X , W)) - E[/i(G)] 

L /isBLi 

V sup Ely h(Z n (r, X, W)) - E[/i(G)] 

Ln! ■f—' heBU 
r€a(ri) 


A v e 

n\ ^ 

rG<x(n) 


W 


sup 

/iGBLi 


— V' E* 

n\ ^ 


sup 

/iGBLi 


rEcr(n) 

For each fixed value of r we have 


h(Z n (r,X, W))-E[h(G)} 
h(Z n (r , X, W)) — E[/i(G)] 


*x,w 


{Z n (r , X, W))(f) = 


-l=Y,(Wi, n -n/b)f( X r -i (i) ), 


which implies that Z n (r, X, W) depends on X only through (X r -i^\) i=1; _ ;b . In partic¬ 
ular 

sup h(Z n (r , x,w)) — E[/i(G)] —So II r (x, w) 

/iGBLi 

where II r (x, w) := ((av-ip), ...,x r -i(b)),w) and we defined for y,w G M b 

b 


S(y, w) := sup h(f^^=y'(w i -n/b)f(y i )\- 

he BLi V V ' n J 


E[/i(G)] 


Since X\ ,...,X n ,W are defined on a product probability space, it follows that the 
measurable majorant of S'oII, r (A", W) with respect to A", W can be expr essed a,s S(■,■)*o 
hhlx, w). This is a consequence of from Lemma 1.2.5 in Ivan der Vaart and Wellner 
( 1996h and combined with the fact that S is uniformly bounded and II r is a coordinate 
projection on a product space. In particular, the symmetry of the problem implies 
that 


n\ 


Y E* 


rGo-(n) 


sup 

L /iGBLi 


h(Z n (r, X, W))-E[h(G)} 


_1 

* 

sup 

J 

-heBLi 


h{Z n { id, A, W))-E[h(G)} 


where id := (1, 2,..., n). Moreover Theorem 3.6.3 in van der Vaart and W ellne r (119961 ) 
with the identification k n — n,n — b implies that 


2 > sup 
he BLi 


h{Z n { id, A, W))-E[h(G)} 


^ 0. 
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By dominated convergence, this yields 


E* 


sup 

L h £BLi 


h(Z n ( id, AT, W)) - E[/i(G)] 


0, 


and thus statement (i) is established. 


Proof of (ii) 

It suffices to prove that [note that h* — h* > 0] 

E [h(Z n (R, X, W))* - h(Z n (R, X, W))*\ ->• 0. 

Observe that by the definition of measurable majorants we have 

h{Z n (R,X,W))*=( E I{R = r}h{Z n (r,X,W))y 

r£a(n) 

< E I{R = r}(h(Z n {r,X,W))y X ’ W , 

r£a(n) 

/ \ *x,w 

since (R,X,W) i->- I{R = r} ( h(Z n (r, X, W ))) is measurable for each r e cr(n). 
Similarly 


h(Z n (R,X,W))*= ( E I{R = r}h(Z n (r,X,W))y 


rGcr(n) 


> E I{R = r}[h(Z n (r,X,W 

rGcr(n) 


*x,w 


Thus 


E [h{Z n {R, X, W))* - h{Z n {R , X, IV))*] 


< e| E I { R = r}((h(Z n (r,X,W))y X ' W ~(h(Z n (r,X,W)) 


rEcr(n) 


*X,W y J 


= E 


(h(Z n (id,X,W))y x ' W - (h(Z n (id,X,W))) 


*x,w 


where the equality in the last line follows by arguments similar to the ones given in the 
proof of (i). Now the expression in the last line converges to zero b y arguments similar 
to the ones given in the proof of (i) and Theorem 3.6.3 in Ivan der Vaart and Wellner 
( 1996h and thus (ii) follows. □ 













10.2 Proof of Theorem 16.11 


Throughout this proof, we will simplify notation by assuming that KL = n. It is easy 
to see that this assumption can be relaxed. 

Introduce the abbreviation S = (si,..., sk), X = (Ad, X n ). Denote by Ps.j the 
probability conditional on X and by Ps the probability conditional on X, J. Similarly, 
let lEg^j, E 5 , Vargj and Vars denote the corresponding versio ns of condi t ional expecta¬ 
tions and variances. Following the discussion on page 277 in Radulovic] ( 1996t h we will 
assume that all suprema we encounter are measurable. This might not always be true, 
but permissibility of T ensures that suitable modifications of our arguments remain 
correct [see the discussion on page 277 in Radulovic (1996)]. 


Define the norm 


hX '■= (E[|/(Xi)| p ]) 1 / p . For an arbitrary 5-net Pg for T with 
respect to || • || Pi _y, denote by fg any point in Tg which minimizes ||/ — g\\ p ,x over g £ Tg. 
Consider the approximating processes A f n (f) b (fs), &s(f) := G( fs ). By Lemma 

B.3 in Volgushev and Shaol (2014) the claim of the Theorem follows once we establish 
that 

(i) For every i £ N: Aw jn -w A x/i for n —> 00 . 

! JjS 


(ii) A i/i ^ G for i —> 00 . 

(iii) For every e > 0: liin,_ 


) limsup n ^ 00 P(sup /e _ F |Af /i>n (/) - >e)=0. 


Part (ii) follows from the properties of the limiting process G [more precisely, there 
exists a version of G with sample paths that are uniformly continu ous w ith respe ct to 
II • 11 2 ,v and thus || • || P) x for p > 2 - see Theorem 2.1 in Arcones and Yu ( 19941 )]. It 
thus remains to establish (i) and (iii). 


In order to establish (i), it suffices to show that for any fixed, finite collection of 
functions fi,fk £ T we have 




\ b (fl),-,®n lb (fk)) f> s (GCA),..., G(/ fc )). 


( 1 ) 


Denote by BLi the set of functions on M. k which are bounded by 1 and are Lipschitz 
continuous with Lipschitz constant bounded by 1. In order to establish ([T]), we need 
to prove that 


sup Eyj h(G^ b (fi), ..., &n,b(fk)) ~ IE[/^(G(/ 1 ),..., G(/ fc ))] =op(l) 

he BLi 


( 2 ) 


Due to the independence between J and X and due to strict stationarity of {X t } te z, the 
distribution of the tuple (Xj, ...,X J+b _i) is the same as the d istr ibution of (A" 1; ■■■,X b ) 
[unconditionally]. Thus the arguments on page 272 in Radulo vic (119961) yield 


(G£ 6 (.A),...,GJ 6 (/*)) f (G(/i),...,G(/ fc )). 


( 3 ) 
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Observe that 


sup 

fteBLi 


Es,/[*«„(/>), G"„(A))] - E[ft(G(/0,.... G(A))] 


< sup EjEg 

hgBLi 


h(GUfi), G",(/*)) - E[fc(G(/,), .... G(A))] 


< Ej sup Eg 

hgBLi 


ft(G®i,(/i), G®,(A)) - E[ft(G(/,), .... G(A))] 


E sup 

/lEBLi 


Thus for any e > 0 

Esj[h(G®,(/i),.... G® t (A))] - E[ft(G(A),.... G(A))] 

h( G®„(A),-,G® t (A)) - E[ft(G(/,),...,G(A))] 
MG® S (A),.... G®AA)) - E[h(G(A), .... G(A))] 


< E sup Eg 

AeBLi 


< e + 2P ( sup Eg 

' AeBLi 


> £ 


since 


By 


ft(G®,(/ 1 ),...,G®„(A)) - E[ft(G(/,),..,G(A))] 


< 2 by the definition of BLi. 


, the probability in the last line of the above equation tends to zero [as n —> 00 ] 


for any fixed e > 0 , and this implies 

E sup E S|J /,(G® s (A),...,G® t (A))-E[fc(G(A),...,G(A))] =o(l). 

AgBLi 

This proves ([2]) and establishes (i). 


Next, let us prove (iii). Fix e > 0. For a function /, define its truncated version 
/ 4 (x) := f(x)I{F(x) < b 1 }. We shall prove that 

sup|G; ^(f)-G^(f)\ = op(l). (4) 

fer 

Additionally, we will apply restricted chaining to show that there exists a sequence of 
sets of functions J„cJ,n 6 N such that 


P 


sup 

f,9^,\\f-9\\ Pl x<S 


GUf) - <^) 


> 2 >£ 


<p( sup \G^ b (f t )-G^ b (g t )\>e)+aS,n) (5) 

V feT n geT,\\f-g\\ P:X <(lnb)-y 2 


where lim ^ 0 lim n ^. 00 £(5, n) — 0 and that 


P( sup \^n,b(f) - G® 6 ( 0 *)| > e) ->■ 0 as n 00 . ( 6 ) 

f£Tn,geT,\\f-g\\ Pl x<(lnb)-z/2 


30 




















Taken together, (fiD-(|6j) imply (iii). 


Before proceeding with the proof, we remark that 
k i l „ 6 


K 


GUn = E E (/(yv 1)i+ ,) - t E /tw ) = : E w>- 

fc=i v i=i j=i fc=i 

Note that by construction, the quantities Vi(f), ..., Vk{/) are independent condition¬ 
ally on J, X. Moreover, for any function / which is uniformly bounde d, we h ave t hat_ 

\ V k ( f) | < 2n~ 1 / 2 L||/|| 00 . Thus by Bernstein’s inequality [Lemma 2.2.9 in va n der V aar t and Wel l ne r 
(1996 


K 


Ps aViW 


k=\ 


> r/j < 2 exp ( — 


2 AV + 2n-V2L||/|| 00 r ? /3 


( 7 ) 


for any v 2 > VarsiVi) [note that by construction Vars(V i) = Varsi} 4) for all k almost 
surely]. Now from the definition of the bootstrap and the fact that KL = n 


KVa r s(14) = l - ^ 
2=1 


(zk 


E /(w+«)) 2 - (5 E Tm E 2 

i=i *=i j=i 


Additionally, due to the independence between J and the original sample and due 
to strict stationarity, the distribution of the tuple (Xj + 1 ,..., Xj + b) is the same as the 
distribution of (X , Xi,) [unconditionally]. A close inspection of the proof of Lemma 
3 in Radulovic (119961 ) [after identifying (n, b) in the latter paper with (6, L) in our 
notation] shows that under the assumption L = o(b p ) for some 0 < p < the 

following two claims are true: 


A n (Q n ) : = (In 6) 3 sup 


Var s 


L 1 / 2 




Var 


L 1 / 2 


E>‘(*‘)) =or( 1) (8) 


2=1 2=1 
for any sequence of sets Q n C T with cardinality 0(n c ) for some fixed c < oo, and 

„ L 


B n := (In 6) 5 


sup Var s 

her:\\h\\ p , x <(\nb )- 3 / 2 


L 1 / 2 


E‘W) =«p(i) 


( 9 ) 


2=1 


where T' := {/ — g : f,g G A}. Note that, when generating the subsamples, Radulovic 
(119961 1 uses ’wrapping’ while we do not. Following the discussion in [Radulovic (119961 ). 
it is easy to see that asymptotically thi s doe s not matter. 

Additionally, equation (14) in Radulovic ( 19961) implies that 


Var 


L 1 / 2 


E A ‘(v,)) <c 0 \\h‘\\l x 


2=1 


( 10 ) 
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for a constant Co which depends only on p and the mixing coefficients j3. 


Proof of (J3J 
Observe that 

1 n 1 b 

SU P I &n,b(f) - ^n,b(/ f ) I < —/= ( F(X* )I{F(X*)>b-y} + T ^ F (X J+j) I{ F (Xj +i )>b-y}) ■ 

1 4=1 J=1 

By the Chebyshev inequality it suffices to show that the expectation of the right-hand 
side of the above inequality is o(l). From the definition of the bootstrap, it is not 
difficult to see that for any i = 1 

0 < E [F (X* )I{F (X*) > 6 7 }] = E [F(X 1 )I{F(X 1 ) > fc 7 }] < ||F|| p ,. Y (P(F(X 1 ) > 6^))^ 

and the right-hand side of the equation above is of order o(6 _( ' p_1 ^ 7 ) by dominated con¬ 
vergence. A similar bound holds for E [F(X J+ j)I{F(X J+ j) > b 1 }]. Thus (J3J) follows 
from the condition n 1 / 2 = 0(6^ -1 ^ 7 ). 


Proof of 05} 

Define ib-\(x) := e x — 1 a n d den ote by || • ||y x the corresponding Orlitz norm [see 
van der Vaart and Wellneil ( 1996 ). Chapter 2.2], Fix 5 > 0 and let k n denote the 
smallest integer such that S/2 kn < (ln6)~ 3 // 2 /2. Successively construct sets Q\ C G 2 C 
... C Gk n which are maximal subsets of T with the property \\f — g\\ p ,x > 2 ~ l 5 for all 
/', g G Gi [here, maximal means that no further element can be added to G% without 
destroying the property that \\f — g\\ p ,x > 2 ~ l 5 for all /,g G Gi}- Observe that the 
cardinality of Gk n is of polynomial order in 2 ~ kn 5 [the cardinality of Gk„ is bounded by 
the packing number, whi ch is polynomial since T is VC- s ee Theorem 2.6.7 and the 
discussion on page 98 in van der Vaart and Wellner (1l996i) ]. and thus of polynomial 
order in n for any fixed 8. Set a{n) := 2 ~ kn 8 and identify the set T n with Gk n ■ Next, 
define the event D n := < 1} and note that P(D n ) —» 1 for n —* 00 [recall 

the definition of A n in (JS}] for any fixed 8, this follows from (JS}. Observe that Ijj n 
is independent of S and that by dehnition of D n we have for any / G T a ,( n ) and any 
77 > 0 


P( |< b (/*)|/ Drt > v) = EE S I { \c B bUt) \ >v} I Dn 


< 2E 


f n n exp - 


Var + In-^Lfrr} 


< 2E 


lD n exp - 


2C 0 \\f\\ 2 pX + (\nb)-* + ln-V 2 Lbrrri 


< 2 exp 


7 ) 


‘2C 0 \\ft\\l x + (\nb)-z + ln-^Lb'yri 
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where the first inequality follows from ((TJ) and the second from the definition of D r 


From the inequality above combined with Lemma 2.2.10 in Ivan der Vaart and Wellneri 
( 19961) [applied with m, — 1] we obtain that for any / € 8F n 

icfn-^W + iWf'Wlx + ilnbry/ 2 ) 

for some constant C that does not depend on /, 8 , n. In particular we obtain for 
sufficiently large n [due to the assumptions on L, b, n] 

G®i,(/‘)/ D „ < C'\\f%, x Vf er n :\\f%, x >(\nb)-V 2 /2. (11) 

V>1 

We now shall apply Lemma 7.1 from Kiev. Volgushev. Dette. and Hallinl (j2014h . In 
the notation of that Lemma, let T := T,d{f,g) := ||/* — g t \\ p ,x,V (In 6)~ 3//2 5 T : = 
ipi,rj = 5, Gf := A careful inspection of the proof of that Lemma reveals 

that (HIT) is already sufficient to obtain the bound 


SUP I ~ G nM)\ I Dn 

f,9^-\\f-9\\ P ,x<S 


<s 1 + 2 sup I K^n-KM)\i Dn 

feT n ,g£T,\\f-g\\ P:X <(lnb)-y 2 

where Si is such that [note that 'ipp 1 (x) = ln(l + x) and that the packing number of 
T with respect to || • || P) x is of polynomial order since T i s VC- see Theorem 2.6.7 and 
the discussion on page 98 in van der Vaart and Welln er (119961) ] 


l|SilU<c' 


1 + | loge|cfe + (8 + 2(In6) 3//2 )(l + | logA|) 


L J (In 6)- 3 / 2 /2 

for some constant C independent of 8,n. To complete the proof of fl5J), observe that 


sup 

f,9^,\\f-9\\ P ,x<S 


P 

< P 

<P{\S 1 \>s) + p( k 
+ 1 -P(D n ). 




B / „t\ 


> 3e 


sup 

f,9^,\\f~9\\p,x<S 


Kb(f) - In n >Ss)+l- P(D n ) 


sup 

feT n ,geT,\\f-g\\ PtX <8nb)-3/2 


Setting £(n, 5) := P(\S] \ > e) + 1 — P(D n ) completes the proof of (J5|). 

Proof of (E]) 

Define P n f := ^^[Li/(^Q) and consider the pseudo-distance d n (f,g) := P n \f — g 
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Observe that to each / e T we can attach a f E J- n such that \\f — f\\ Pt x < (Info) -3 / 2 . 
Let % :={/* — /*: / 6 J} and denote by l~L n an n~ 2 net for % under d n . To each 
h e H, attach a h e H such that d n (h,h ) < nr 2 . Since J 7 is VC, "H n can be chosen 
such that, for n sufficiently large, the cardinality of is bounded by C(P n F) c n c for 
some fixed constants C, c which do not depend on J, S. Define the event [recall the 
definition of B n in (TTOl) ] 


D n : = 


B n V 


n 


1 y nx d - e[f(a',)] 


i— 1 


< l 


and note that by (TTOl) P(D n ) 1 [since under the assumptions of the present Theorem 
P n F — EF[V]] —> 0 in probability] and that ID n is independent of S. Observe that 


sup 


\^n,b(j) ~ ^n, b (gj\ — SUP 

hen 




+ sup 
hen 


K b ( h - h ) 


and that for any function / 


li I — U — 1 

«»,*(/)[ < ^Ei/wii + tL 

V 4 i=\ 1=0 


" ^l/(W + ,)l < 4]; E I/Ml + E I/Ml 

2=1 2=1 


<2r, 3 / 2 -^|/(Xi)|. 

n 

Z=1 

Thus by definition of h 


sup 

hen 


1 ” 

G nb(h — h) < 2n 3//2 sup — |(/i — /r)(X;)| < 2 rF 1 ^ 2 . 


Hence it suffices to show sup fte ^ G ^ b {h) = op(l). To this end, note that on D n the 
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cardinality of 'H n is bounded by C'n c for some constants C', d independent of n. Thus 


P[ sup 

'hen 


< b (h) 


> T 


< P(\ sup 

h&H n 


G?,„(h) > t | n D„ ) + 1 - P(D n ) 


- EE S t-PW*, 


< 


EE S [ ^ h 

heHn 




= E 


> 6 , y mi 


h&Hn 


= Cn c ' E 

by m 

< 2C'n c 'E 

< 2C'n c 'E 


{\G* b [h)\>T}\ 


+ 0(1) 
+ 0 ( 1 ) 

+ o(l) 


supI 3 P s (\G b (h)\ > t) +o(1) 

l /i€W -I 


sup exp 
l heH 


1 

2 


T 


Var sU-^YstihiX*)) + \n~^UPr 


D„ 


exp l — 


' s "PheH Var s (+~ 1/2 Ei=i K x t)) + | n-Wbr/T 

by the definition of % and D n 


D 


+ 0 ( 1 ) 

+ 0 ( 1 ) 


< 2 C'n c E 


exp l — 


T 


2 (In 6)" 2 + |n-V 2 L^ r 


D„ 


+ o(l) 


< 2C’n c ' exp 


+ o(l) 


2(ln&)- 2 + fn- 1 / 2 L^r, 
since (Inn) 2 = o(n _1 +L6 7 ) by the assumptions on L,b,n 

= o(l). 


This shows that sup h6W Gf lb (h) = 0p(l) and completes the proof of 


□ 
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